To enable a safe and effective human-robot cooperation, it is crucial to develop models for the identification of human activities. Egocentric vision seems to be a viable solution to solve this problem, and therefore many works provide deep learning solutions to infer human actions from first person videos. However, although very promising, most of these do not consider the major challenges that comes with a realistic deployment, such as the portability of the model, the need for real-time inference, and the robustness with respect to the novel domains (i.e., new spaces, users, tasks). With this paper, we set the boundaries that egocentric vision models should consider for realistic applications, defining a novel setting of egocentric action recognition in the wild, which encourages researchers to develop novel, applications-aware solutions. We also present a new model-agnostic technique that enables the rapid repurposing of existing architectures in this new context, demonstrating the feasibility to deploy a model on a tiny device (Jetson Nano) and to perform the task directly on the edge with very low energy consumption (2.4W on average at 50 fps).
translated by 谷歌翻译
在本报告中,我们描述了我们提交给Epic-Kitchens-100无监督的域适应(UDA)挑战的技术细节。为了应对UDA设置下存在的域移位,我们首先利用了最近的域概括(DG)技术,称为相对规范对准(RNA)。其次,我们将这种方法扩展到无标记的目标数据工作,从而使模型更简单地以无监督的方式适应目标分布。为此,我们将UDA算法包括在内,例如多级对抗对准和专心熵。通过分析挑战设置,我们注意到数据中存在二次并发转移,通常称为环境偏见。它是由存在不同环境(即厨房)引起的。为了处理这两个班次(环境和时间段),我们扩展了系统以执行多源多目标域的适应性。最后,我们在最终提案中采用了不同的模型来利用流行视频体系结构的潜力,并为合奏改编介绍了两次损失。我们的提交(条目“ PLNET”)在排行榜上可见,并在“动词”中排名第二,并且在“名词”和“ Action”中都处于第三位。
translated by 谷歌翻译
物理模拟器在安全,不受约束的环境中方便学习加强学习政策表现出了巨大的希望。但是,由于现实差距,将获得的知识转移到现实世界可能会具有挑战性。为此,最近已经提出了几种方法来自动调整具有后验分布的实际数据,以在训练时与域随机化一起使用。这些方法已被证明在不同的设置和假设下适用于各种机器人任务。然而,现有文献缺乏对转移性能和实际数据效率的现有自适应域随机方法的详尽比较。在这项工作中,我们为离线和在线方法(Simopt,Bayrn,Droid,Dropo)提供了一个开放的基准,以阐明最适合每个设置和手头的任务。我们发现,在线方法受到下一次迭代的当前学会策略的质量受到限制,而离线方法有时可能会在使用开环命令中模拟中重播轨迹时失败。所使用的代码将在https://github.com/gabrieletiboni/adr-benchmark上发布。
translated by 谷歌翻译
Background: Image analysis applications in digital pathology include various methods for segmenting regions of interest. Their identification is one of the most complex steps, and therefore of great interest for the study of robust methods that do not necessarily rely on a machine learning (ML) approach. Method: A fully automatic and optimized segmentation process for different datasets is a prerequisite for classifying and diagnosing Indirect ImmunoFluorescence (IIF) raw data. This study describes a deterministic computational neuroscience approach for identifying cells and nuclei. It is far from the conventional neural network approach, but it is equivalent to their quantitative and qualitative performance, and it is also solid to adversative noise. The method is robust, based on formally correct functions, and does not suffer from tuning on specific data sets. Results: This work demonstrates the robustness of the method against the variability of parameters, such as image size, mode, and signal-to-noise ratio. We validated the method on two datasets (Neuroblastoma and NucleusSegData) using images annotated by independent medical doctors. Conclusions: The definition of deterministic and formally correct methods, from a functional to a structural point of view, guarantees the achievement of optimized and functionally correct results. The excellent performance of our deterministic method (NeuronalAlg) to segment cells and nuclei from fluorescence images was measured with quantitative indicators and compared with those achieved by three published ML approaches.
translated by 谷歌翻译
The broad usage of mobile devices nowadays, the sensitiveness of the information contained in them, and the shortcomings of current mobile user authentication methods are calling for novel, secure, and unobtrusive solutions to verify the users' identity. In this article, we propose TypeFormer, a novel Transformer architecture to model free-text keystroke dynamics performed on mobile devices for the purpose of user authentication. The proposed model consists in Temporal and Channel Modules enclosing two Long Short-Term Memory (LSTM) recurrent layers, Gaussian Range Encoding (GRE), a multi-head Self-Attention mechanism, and a Block-Recurrent structure. Experimenting on one of the largest public databases to date, the Aalto mobile keystroke database, TypeFormer outperforms current state-of-the-art systems achieving Equal Error Rate (EER) values of 3.25% using only 5 enrolment sessions of 50 keystrokes each. In such way, we contribute to reducing the traditional performance gap of the challenging mobile free-text scenario with respect to its desktop and fixed-text counterparts. Additionally, we analyse the behaviour of the model with different experimental configurations such as the length of the keystroke sequences and the amount of enrolment sessions, showing margin for improvement with more enrolment data. Finally, a cross-database evaluation is carried out, demonstrating the robustness of the features extracted by TypeFormer in comparison with existing approaches.
translated by 谷歌翻译
Digital media have enabled the access to unprecedented literary knowledge. Authors, readers, and scholars are now able to discover and share an increasing amount of information about books and their authors. Notwithstanding, digital archives are still unbalanced: writers from non-Western countries are less represented, and such a condition leads to the perpetration of old forms of discrimination. In this paper, we present the Under-Represented Writers Knowledge Graph (URW-KG), a resource designed to explore and possibly amend this lack of representation by gathering and mapping information about works and authors from Wikidata and three other sources: Open Library, Goodreads, and Google Books. The experiments based on KG embeddings showed that the integrated information encoded in the graph allows scholars and users to be more easily exposed to non-Western literary works and authors with respect to Wikidata alone. This opens to the development of fairer and effective tools for author discovery and exploration.
translated by 谷歌翻译
Content-Controllable Summarization generates summaries focused on the given controlling signals. Due to the lack of large-scale training corpora for the task, we propose a plug-and-play module RelAttn to adapt any general summarizers to the content-controllable summarization task. RelAttn first identifies the relevant content in the source documents, and then makes the model attend to the right context by directly steering the attention weight. We further apply an unsupervised online adaptive parameter searching algorithm to determine the degree of control in the zero-shot setting, while such parameters are learned in the few-shot setting. By applying the module to three backbone summarization models, experiments show that our method effectively improves all the summarizers, and outperforms the prefix-based method and a widely used plug-and-play model in both zero- and few-shot settings. Tellingly, more benefit is observed in the scenarios when more control is needed.
translated by 谷歌翻译
Anticipating future actions based on video observations is an important task in video understanding, which would be useful for some precautionary systems that require response time to react before an event occurs. Since the input in action anticipation is only pre-action frames, models do not have enough information about the target action; moreover, similar pre-action frames may lead to different futures. Consequently, any solution using existing action recognition models can only be suboptimal. Recently, researchers have proposed using a longer video context to remedy the insufficient information in pre-action intervals, as well as the self-attention to query past relevant moments to address the anticipation problem. However, the indirect use of video input features as the query might be inefficient, as it only serves as the proxy to the anticipation goal. To this end, we propose an inductive attention model, which transparently uses prior prediction as the query to derive the anticipation result by induction from past experience. Our method naturally considers the uncertainty of multiple futures via the many-to-many association. On the large-scale egocentric video datasets, our model not only shows consistently better performance than state of the art using the same backbone, and is competitive to the methods that employ a stronger backbone, but also superior efficiency in less model parameters.
translated by 谷歌翻译
The proliferation of radical online communities and their violent offshoots has sparked great societal concern. However, the current practice of banning such communities from mainstream platforms has unintended consequences: (I) the further radicalization of their members in fringe platforms where they migrate; and (ii) the spillover of harmful content from fringe back onto mainstream platforms. Here, in a large observational study on two banned subreddits, r/The\_Donald and r/fatpeoplehate, we examine how factors associated with the RECRO radicalization framework relate to users' migration decisions. Specifically, we quantify how these factors affect users' decisions to post on fringe platforms and, for those who do, whether they continue posting on the mainstream platform. Our results show that individual-level factors, those relating to the behavior of users, are associated with the decision to post on the fringe platform. Whereas social-level factors, users' connection with the radical community, only affect the propensity to be coactive on both platforms. Overall, our findings pave the way for evidence-based moderation policies, as the decisions to migrate and remain coactive amplify unintended consequences of community bans.
translated by 谷歌翻译
IoT sensors, especially video cameras, are ubiquitously deployed around the world to perform a variety of computer vision tasks in several verticals including retail, healthcare, safety and security, transportation, manufacturing, etc. To amortize their high deployment effort and cost, it is desirable to perform multiple video analytics tasks, which we refer to as Analytical Units (AUs), off the video feed coming out of every camera. In this paper, we first show that in a multi-AU setting, changing the camera setting has disproportionate impact on different AUs performance. In particular, the optimal setting for one AU may severely degrade the performance for another AU, and further the impact on different AUs varies as the environmental condition changes. We then present Elixir, a system to enhance the video stream quality for multiple analytics on a video stream. Elixir leverages Multi-Objective Reinforcement Learning (MORL), where the RL agent caters to the objectives from different AUs and adjusts the camera setting to simultaneously enhance the performance of all AUs. To define the multiple objectives in MORL, we develop new AU-specific quality estimator values for each individual AU. We evaluate Elixir through real-world experiments on a testbed with three cameras deployed next to each other (overlooking a large enterprise parking lot) running Elixir and two baseline approaches, respectively. Elixir correctly detects 7.1% (22,068) and 5.0% (15,731) more cars, 94% (551) and 72% (478) more faces, and 670.4% (4975) and 158.6% (3507) more persons than the default-setting and time-sharing approaches, respectively. It also detects 115 license plates, far more than the time-sharing approach (7) and the default setting (0).
translated by 谷歌翻译